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FAST RATES FOR SUPPORT VECTOR MACHINES USING 
GAUSSIAN KERNELS^ 

By Ingo Steinwart and Clint Scovel 

Los Alamos National Laboratory 

For binary classification we establish learning rates up to the or- 
der of for support vector machines (SVMs) with hinge loss and 
Gaussian RBF kernels. These rates are in terms of two assumptions 
on the considered distributions: Tsybakov's noise assumption to es- 
tablish a small estimation error, and a new geometric noise condition 
which is used to bound the approximation error. Unlike previously 
proposed concepts for bounding the approximation error, the geomet- 
ric noise assumption does not employ any smoothness assumption. 

1. Introduction. In recent years support vector machines (SVMs) have 
been the subject of many theoretical considerations. Despite this effort, their 
learning performance on restricted classes of distributions is still widely un- 
known. In particular, it is unknown under which nontrivial circumstances 
SVMs can guarantee fast learning rates. The aim of this work is to use con- 
cepts like Tsybakov's noise assumption and local Rademacher averages to 
establish learning rates up to the order of for nontrivial distributions. In 
addition to these concepts that are used to deal with the stochastic part of 
the analysis we also introduce a geometric assumption for distributions that 
allows us to estimate the approximation properties of Gaussian RBF kernels. 
Unlike many other concepts introduced for bounding the approximation er- 
ror, our geometric assumption is not in terms of smoothness but describes 
the concentration and the noisiness of the data-generating distribution near 
the decision boundary. 

Let us formally introduce the statistical classification problem. To this end 
let us fix a subset X C M'^. We write Y := {—1, 1}. Given a finite training set 
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T = ((xi, yi), . . . , {xn,yn)) £ {X X y)", the classification task is to predict the 
label y oi a new sample {x, y). In the standard batch model it is assumed that 
the samples {xi,yi) are i.i.d. according to an unknown (Borel) probability 
measure P on X xY. Furthermore, the new sample is drawn from P 

independently of T. Given a classifier C that assigns to every training set 
T a measurable function Jt ■ X ^M, the prediction of C for y is sign^^ /(^)) 
where sign(O) := 1. The quality of such a function / is measured by the 
classification risk 

7^p(/):=P({(x,y):sign/(x)/y}), 

which should be as small as possible. The smallest achievable risk IZp := 
inf{7^p(/)| / : X ^ M measurable} is called the Bayes risk of P and a func- 
tion attaining this risk is called a Bayes decision function and is denoted by 
fp. Obviously, a good classifier should at least produce decision functions 
whose risks converge to the Bayes risk for all distributions P. This leads to 
the notion of universally consistent classifiers which is thoroughly treated 
in [14]. The next naturally arising question is whether there are classifiers 
which guarantee a specific convergence rate for all distributions. Unfortu- 
nately, this is impossible by a result of Devroye (see [14], Theorem 7.2). 
However, if one restricts consideration to certain smaller classes of distribu- 
tions, such "learning rates," for example, in the form of 

P"(r G (X X y)" : IZpUt) <np + C{x)n-f^) > 1 - e"^, n > 1, X > 1, 

where /? > and C(x) > are constants, exist for various classifiers. Typi- 
cal assumptions for such classes of distributions are either in terms of the 
smoothness of the function r]{x) := P{y = l\x) (see, e.g., [19, 38]), or in terms 
of the smoothness of the "decision boundary" (see, e.g., [18, 35]). Moreover, 
the corresponding learning rates are slower than n~^/^ if no additional as- 
sumptions on the amount of the noise in the labels, for example, on the 
distribution of the random variable 

(1) min{l - r/(x),r/(x)} = i - |7/(x) - i| 

around the critical level 1/2, are imposed. On the other hand, [35] showed 
that ERM-type classifiers can learn faster than n"^/^, if one quantifies how 
likely the noise in (1) is close to 1/2 (see Definition 2.2 in the following sec- 
tion) . Unfortunately, however the ERM classifier considered in [35] requires 
substantial knowledge on how to approximate the desired Bayes decision 
functions. Moreover, ERM classifiers are based on combinatorial optimiza- 
tion problems and hence they are usually hard to implement and in general 
there exist no efficient algorithms. 

On the one hand SVMs do not share the implementation issues of ERM 
since they are based on a convex optimization (see, e.g., [12, 26] for algorith- 
mic aspects). On the other hand, however, their known learning rates are 
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rather unsatisfactory since either the assumptions on the distributions are 
too restrictive as in [28] or the established learning rates are too slow as in 
[37]. Our aim is to give SVMs a better theoretical foundation by establishing 
fast learning rates for a wide class of distributions. To this end we propose a 
geometric noise assumption (see Definition 2.3) which describes the concen- 
tration of the measure |2?7 — 1| dPx — where Px is the marginal distribution 
of P with respect to X — near the decision boundary. This assumption is 
then used to determine the approximation properties of Gaussian kernels 
which are used in the SVMs we consider. Provided that the tuning param- 
eters are optimally chosen our main result then shows that the resulting 
learning rates for these classifiers can be as fast as n^^. 

The rest of this work is organized as follows: In Section 2 we introduce 
the main concepts of this work and then present our results. In Section 3 we 
recall some basic theory on reproducing kernel Hilbert spaces and prove a 
new covering number bound for Gaussian kernels that describes a trade-off 
between the kernel widths and the radii of the covering balls. In Section 4 
we then show the approximation results that are related to our proposed 
geometric noise assumption. The last sections of the work contain the actual 
proof of our rates: In Section 5 we establish a general bound for ERM- 
type classifiers involving local Rademacher averages which is used to bound 
the estimation error in our analysis of SVMs. In order to apply this result 
we need "variance bounds" for SVMs which are established in Section 6. 
Interestingly, it turns out that sharp versions of these bounds depend on 
both Tsybakov's noise assumption and the approximation properties of the 
kernel used. Finally, we prove our learning rates in Section 7. 

2. Definitions and main results. In this section we first recall some basic 
notions related to support vector machines which are needed throughout this 
text. In Section 2.2, we then present a covering number bound for Gaussian 
RBF kernels which will play an important role in our analysis of the esti- 
mation error of SVMs. In Section 2.3 we recall Tsybakov's noise assumption 
which will allow us to establish learning rates faster than n~^/^. Then, in 
Section 2.4, we introduce the new geometric assumption that is used to esti- 
mate the approximation error for SVMs with Gaussian RBF kernels. Finally, 
we present and discuss our learning rates in Section 2.5. 

2.1. RKHSs, SVMs and basic definitions. For two functions / and g we 
use the notation /(A) ^ g{X) to mean that there exists a constant C > 
such that /(A) < Cg{\) over some specified range of values of A. We also 
use the notation ^ with similar meaning and the notation ~ when both < 
and ^ hold. In particular, we use the same notation for sequences. 

If not stated otherwise, X always denotes a compact subset of which 
is equipped with the Borel a-algebra. 
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Recall (see, e.g., [1, 6]) that every positive definite kernel k:XxX^M 
has a unique reproducing kernel Hilbert space H (RKHS) whose unit ball is 
denoted by Bh- Although we sometimes use generic kernels and RKHSs, we 
are mainly interested in Gaussian RBF kernels, which are the most widely 
used kernels in practice. Recall that these kernels are of the form 

kfj{x, x') = exp(— cT^||x — x'112), X, x G X, 

where <t > is a free parameter whose inverse 1/<t is called the width of kfj. 
We usually denote the corresponding RKHSs which are thoroughly described 
in [32] by H„{X) or simply 

Let us now recall the definition of SVMs. To this end let P be a distribu- 
tion on X X y and Z : 1" x M ^ [0, 00) be the hinge loss, that is, 

/(y,t) :=max{0,l -yt}, y£Y,t£R. 

Furthermore, we define the I -risk of a measurable function / : X — > M by 

7^^,p(/):=E(,,,)^p/(y,/(x)). 

Now let H he a RKHS over X consisting of measurable functions. For A > 
we denote a solution of 

(2) argmin(A||/||?, + 7^^,p(/ + 5)) 

6eK 

by ifp,x,bp^x). Recall that fp^x is uniquely determined (see, e.g., [30]), while 
in some situations this is not true for the offset bp^x. In general we thus 
assume that bp^x is an arbitrary solution. However, for the (trivial) distri- 
butions that satisfy P{{y*}\x) = 1 Px-as. for some y* £Y we explicitly set 
bp,\ '■=y* in order to control the size of the offset. Furthermore, if P is an em- 
pirical distribution with respect to a training set T = ((xi,yi), . . . , {xn,yn)) 
we write TZi^rif) and {fT,\-,bT,\)- Note that in this case the above condition 
under which we set 6t,a '■= U* means that all labels yi of T are equal to y* . 
An algorithm that constructs {fT,x,bT,x) for every training set T is called 
an SVM with offset. Furthermore, for A > we denote the unique solution 
of 

(3) argmin(A||/||l, + 7^^,p(/)) 

by /p,A and for empirical distributions based on a training set T we again 
write /t,a- A corresponding algorithm is called an SVM without offset. Recall 
that under some assumptions on the RKHS used and the choice of the 
regularization parameter A it can be shown that both SVM variants are 
universally consistent (see [29, 31, 39]); however, no satisfying learning rates 
have been established yet. 
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We also emphasize that in many theoretical papers only SVMs without 
offset are considered since the offset often causes serious technical problems 
in the analysis. However, in practice usually SVMs with offset are used and 
therefore we feel that these algorithms should be considered in theory, too. 
As we will see, our techniques can be applied for both variants. The resulting 
rates coincide. 

2.2. Covering numbers for Gaussian RKHSs. In order to bound the es- 
timation error of SVMs we need a complexity measure for the RKHSs used, 
which is introduced in this section. To this end let ^ C be a subset of a 
Banach space E. The covering numbers of A are defined by 



e > 0, where Be denotes the closed unit ball of E. Moreover, for a bounded 
linear operator S : E ^ F between two Banach spaces E and F, the covering 
numbers are J\f{S,e) := J\f{SBE,e,F). 

Given a training set T = {{xi,yi), . . . , {x^, Un)) & {X x ¥)'"■ we denote the 
space of all equivalence classes of functions / : V x V — > M with norm 



by L2(T). In other words, L2{T) is an L2-space with respect to the empirical 
measure of T. Note that for a function / : X x y ^ M a canonical represen- 
tative in L2{T) is its restriction In addition, L2(Tx) denotes the space 
of all (equivalence classes of) square integrable functions with respect to the 
empirical measure of xi, . . . , 

The proof of our learning rates uses the behavior of J\f{Bij^(^x)-,^-, L2{Tx)) 
in e and a in order to bound the estimation error. Unfortunately, all known 
results on covering numbers for Gaussian RBF kernels emphasize the role of 
e and hence we will establish in Section 3 the following result which describes 
a suitable trade-off between the influence of e and a. 

Theorem 2.1. Let c > 1, X C. be a compact subset with nonempty 
interior, and H„{X) be the RKHS of the Gaussian RBF kernel k(j on X. 
Then for all <p<2 and all 5 > 0, there exists a constant Cp^s,d > inde- 
pendent of a such that for all e > we have 



Af{A, e, E) := min< n > 1 : 3xi 





sup logAf{BH^(^x),£,L2{Tx)) < Cp^s,d'^' 



(l-p/2)(l+S)d^-p 
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2.3. Tsybakov^ s noise assumption. Now we recall Tsybakov's noise con- 
dition, which describes the amount of noise in the labels. In order to motivate 
Tsybakov's assumption let us first observe that by equation (1) the function 
\2rj — 1| can be used to describe the noise in the labels of a distribution 
P. Indeed, in regions where this function is close to 1 there is only a small 
amount of noise, whereas function values close to only occur in regions with 
a high level of noise. The following definition in which we use the convention 
t°° := for t G (0, 1) describes the size of the latter regions: 

Definition 2.2. Let < < oo and P be a probability measure on 
X X Y. We say that P has Tsybakov noise exponent q if there exists a 
constant C > such that for all sufficiently small t > we have 

(5) Px{{xeX:\27]{x)-l\<t})<C-t''. 

Obviously, P has Tsybakov noise exponent q> ii and only if |2r/ — G 
Lq^oo{Px), where -Lg,oo denotes a Lorentz space (see [5]). It is also easy to see 
that P has Tsybakov noise exponent q' for all q' <q '\i P has Tsybakov noise 
exponent q. Furthermore, all distributions obviously have noise exponent 0. 
In the other extreme case q = oo the conditional probability rj is bounded 
away from 1/2. In particular, noise-free distributions have exponent (7 = 00. 
Furthermore, for g < 00 it is easy to check that Definition 2.2 is satisfied 
if and only if (5) holds for all t > and a possibly different constant C. 
Finally, note that (5) does not make any assumptions on the location of the 
noisy set, and hence we prefer the notion "noise condition" rather than the 
often used term "margin condition." 

2.4. A new geometric assumption for distributions. In this section we 
introduce a condition for distributions that will allow us to estimate the 
approximation error for Gaussian RBF kernels. To this end let I be the 
hinge loss function and P be a distribution on X. Let 

7^/,p := mi{ni^p[f)\f -.X ^ M measurable} 

denote the smallest possible /-risk of P. Since functions achieving the mini- 
mal /-risk occur in many situations we indicate them by fi^p if no confusion 
regarding the nonuniqueness of this symbol can be expected. Furthermore, 
recall that fi^p has a shape similar to the Bayes decision function sign/p 
(see, e.g., [30]). Now, given a RKHS H over X we define the approximation 
error function with respect to H and P by 

(6) a(A) := inf (All/Ill, + 7^^,p(/) - 7^^,p), A > 0. 

Note that the obvious analogue of the approximation error function with 
offset is not greater than the above approximation error function without 
offset and hence we restrict our attention to the latter for simplicity. 
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For A > 0, the approximation error function describes how weh A||/p^a||h + 
^«,p(/p,a) approximates TZi^p. For example, it was shown in [31] that we 
have hmA^oa(-^) = for all P if X is a compact metric space and H is 
dense in the space of continuous functions C{X). However, in nontrivial 
situations there cannot exist a convergence rate which holds uniformly for 
all distributions P. Since H(^[X) is dense in C{X) for compact X C M*^ and 
all <T > these statements are in particular true for the approximation error 
functions ao-(-) of the Gaussian RBF kernels with fixed width 1/cj. Moreover, 
we are not aware of any weak condition on r/ or P that ensures acr(A) ^ A^ 
for A — > and some /? > 0, and the results of [27] indicate that such behavior 
of aa{-) may actually require very restrictive conditions. In the following we 
will therefore present a condition on P that allows us to estimate ao-(A) by 
A and a. In particular it will turn out that ao-(A) — > with a polynomial 
rate in A if we relate a to A in a certain manner. In order to introduce this 
assumption on P we first define the classes of P by X_i := {x ^ X -.rj^x) < 
i}, Xi := {x G X : 7]{x) > |} and Xq := {x X : ■q{x) = i} for some choice 
of rj. Now we define a distance function x i— > by 

( (i(x,XoUXi), if xGX_i, 

(7) r^:=<^ d(x,XoUX_i), ifxGXi, 

[ 0, otherwise, 

where d{x,A) denotes the distance of x to a set A with respect to the 
Euclidean norm. Roughly speaking, measures the distance of x to the 
"decision boundary." Now we can present the already announced geometric 
condition for distributions. 

Definition 2.3. Let X C M'^ be compact and P be a probabihty mea- 
sure on X xY . We say that P has geometric noise exponent a > if there 
exists a constant C > such that 

(8) j |2r7(x)-l|exp(^-^^Px(d2:)<Ct"'^/2^ t > 0. 

We say that P has geometric noise exponent oo if it has geometric noise 
exponent a for all a > 0. 

Note that in the above definition we neither make any kind of smoothness 
assumption nor do we assume a condition on Px in terms of absolute conti- 
nuity with respect to the Lebesgue measure. Instead, the integral condition 
(8) describes the concentration of the measure |2r7 — 1| dPx near the deci- 
sion boundary in the sense that the less the measure is concentrated in this 
region the larger the geometric noise exponent can be chosen. The following 
example illustrates this. 
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Example 2.4. Since exp(— t) < Cat " holds for alH > and a constant 
Cq, > only depending on a > 0, we easily see that (8) is satisfied whenever 



where LQ,rf(|2ry — 1| dPx) denotes the usual Lebesgue space of functions that 
are ad-integrable with respect to the measure \2r] — \\dPx- Now, let us 
suppose Xq = for a moment. In this case Tx measures the distance to the 
class X does not belong to. In particular, (9) holds for a = oo if and only if 
the two classes X^i and Xi have strictly positive distance. Moreover, if (9) 
holds for some < a < oo the two classes may "touch," that is, the decision 
boundary dX-i H dXi is nonempty. Consequently, we can easily construct 
distributions P that have geometric noise exponent oo and touching classes, 
but also satisfy fp ^ H(j{X) for all o" > 0. However, note that for such P the 
measure 1 2r/ — 1 1 dPx must obviously have a very low concentration near the 
decision boundary. 

We now describe a simple regularity condition on i] near the decision 
boundary that can be used to guarantee a geometric noise exponent. 

Definition 2.5. Let X C M'^, P be a distribution on X xY and 7 > 0. 
We say that P has an envelope of order 7 if there is a constant > such 
that for Px-almost all x G X we have 



Obviously, if P has an envelope of order 7 then the graph of x 1— > 2r]{x) — 1 
lies in a multiple of the envelope defined by rj at the top and by —t] at 
the bottom. Consequently, r] can be very irregular away from the decision 
boundary but cannot be discontinuous when crossing it. The rate of conver- 
gence of r]{x) — > 1/2 for r^; — > is described by 7. 

Interestingly, for distributions having both an envelope of order 7 and a 
Tsybakov noise exponent q we can bound the geometric noise exponent, as 
the following theorem, which is proved in Section 4, shows. 

Theorem 2.6. Let X cW^ be compact and P be a distribution onX xY 
that has an envelope of order 7 > and a Tsybakov noise exponent q S [0, 00) . 
Then P has geometric noise exponent {q + l)^d~^ ifq>^, and geometric 
noise exponent a for all a < (g + l)^d^^ otherwise. 

Now the main result of this subsection which is proved in Section 4 shows 
that for distributions having a nontrivial geometric noise exponent we can 
bound the approximation error function for Gaussian RBF kernels. 



(9) 



(x^T-i)GL,rf(|27?-l|dPx) 



(10) 



\2r]{x)-l\<c^Tl 
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Theorem 2.7. Let a > 0, X be the closed unit ball of the Euclidean 
space and a^{-) be the approximation error function with respect to 
Hfj{X). Furthermore, let P be a distribution on X xY that has geometric 
noise exponent < a < oo with constant C in (8). Then there is a constant 
Crf > depending only on the dimension d such that for all X> we have 

(11) a.(A) < Cdia'^X + C(2(i)"'^/V-"'^). 

In order to let the right-hand side of (11) converge to zero it is necessary to 
assume both A ^ and a ^ oo. An easy consideration shows that the fastest 
convergence rate is achieved if (j(A) := X~^/((°'+^)^\ In this case we have 
(^ct{X)W ^ A"/^""*"^^. In particular, we can obtain rates up to linear order in 
A for sufficiently benign distributions. The price for this good approximation 
property is, however, an increasing complexity of the hypothesis class Bh^^^-^ , 
as we have seen in Theorem 2.1. 



2.5. Learning rates for SVMs using Gaussian RBF kernels. With the 
help of the geometric noise assumption we can now present our learning rates 
for SVMs using Gaussian RBF kernels. Note again that these polynomial 
rates do not require a smoothness assumption on P. Furthermore note that 
we use the convention "^^^ := | for a,cG (0,oo), 6, [0,oo) in order to 
make the presentation compact. 

Theorem 2.8. Let X be the closed unit ball ofW^, and P be a distribu- 
tion on X xY with Tsybakov noise exponent q £ [0, cxd] and geometric noise 
exponent a £ (0,oo). We define 



a q + 2 

if a< 



fi:-- 



2a + l' 2q 

2a{q + l) . 

otherwise, 



2a(g + 2) + 3g + 4' 

and Xn := n"^""'"^)/"^ and an '■= n'^/("'^) in both cases. Then for all e > 
there exists a C > such that for all x > 1 and n> 1 the SVM without 
offset using the Gaussian RBF kernel k^^ satisfies 

Pr* (T€{Xx Yr ■■ np{fT,xJ <np + Cx2n-^+^) > 1 - e'^ 



where Pr* denotes the outer probability of P" in order to avoid measurability 
considerations. If a = oo the latter inequality holds if an = cr is a constant 
with a > 2^/d. Finally, all results also hold for the SVM with offset. 

Remark 2.9. The above learning rates are faster than the "parametric" 
rate if and only if a > {3q + 4)/(2g). For q = oo the latter condition 

becomes a > 3/2 and in an "intermediate" case g = 1 it becomes a > 7/2. 
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Remark 2.10. It is important to note that our techniques can also be 
used to estabhsh rates for other definitions of the sequences (A„) and ((T„). 
In fact, Theorem 2.7 guarantees ao-„(A„) — > (which is necessary for our 
techniques to produce any rate) if — > oo and (T^A„ — > 0. In particular, if 
A„ := n~'' and cj„ := for some l,h > with nd < l, these conditions are 
satisfied and a conceptually easy but technically involved modification of 
our proof can produce rates for certain ranges of l (and thus k). In order to 
keep the presentation as short as possible we have omitted the details and 
focused on the best possible rates. 

Remark 2.11. Unfortunately, the choice of A„ and cr„ that yields the 
optimal rates within our techniques, requires to know the values of a and 
q, which are typically not available. Adaptive methods which do not require 
such knowledge are still unknown. 

Remark 2.12. Theorem 2.7 and Theorem 2.8 establish results for all 
distributions having some geometric noise exponent. However, for certain 
distributions of this type the resulting rates are not satisfactory. For ex- 
ample consider the distribution P on X := [—1,1] whose marginal distri- 
bution Px equals the uniform distribution and whose conditional distribu- 
tion r]{x) := P{y = l\x) satisfies \2ri{x) — 1| = \x\'^ , x € X, for some constant 
7 £ (0,oo). Then P obviously has Tsybakov noise exponent q := I/7, and 
Theorem 2.6 or a simple modification of the proof of Theorem 2.7 shows 
that P has geometric noise exponent a := 1 -|- 7. Theorem 2.8 thus gives a 
rate of the form n~^^^ for d = }% '^t!i^?A , which is never faster than n~^^'^. 
Though this is disappointing at first glance, it is not really surprising since 
the proof of Theorem 2.7 is not tailored to distributions having such simple 
decision functions. We believe that sharper bounds on the approximation er- 
ror function (and thus faster learning rates) for this and other distributions 
are possible, but a detailed analysis is beyond the scope of this paper. 

Remark 2.13. Another interesting but open question is whether the 
obtained rates are optimal for the class of considered distributions. In order 
to approach this question let us consider the case a = 00, which roughly 
speaking describes the case of almost no approximation error. In this case our 
rates are essentially of the form n^i~^^)/^i+'^) ^ which coincides with the rates 
Tsybakov (see [35]) achieved for certain ERM classifiers based on hypothesis 
classes of small complexity. The latter rates in turn cannot be improved in a 
minimax sense for certain classes of distributions as was also shown in [35] . 
This discussion indicates that the techniques used for the stochastic part of 
our analysis may be strong enough to produce optimal results. However, if 
we consider the case a < 00 then the approximation error function described 
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in Theorem 2.7 and its influence on the estimation error (see our proofs, in 
particular Section 5 and Section 7) have a significant impact on the obtained 
rates. Since the sharpness of Theorem 2.7 is unclear to us we make no 
conjecture regarding the optimality of our rates in the general case. 

3. Proof of Theorem 2.1. The main goal of this section is to prove The- 
orem 2.1, which is done in Section 3.2. To this end we provide in Section 3.1 
some RKHS theory which is used throughout this work. 

3.1. Some basic RKHS theory. For the proofs of this section we have to 
recall some basic facts from the theory of RKHSs. To this end let X C M'^ 
be a compact subset and k:X x X ^Mhe a continuous and positive semi- 
definite kernel with RKHS H. Then H consists of continuous functions on 
X and for f £ H we have ||/||oo < -f^ll/ll-f/, where 

(12) K := sup Jk{x,x). 

Consequently, if the embedding of the RKHS H into the space of continuous 
functions C{X) is denoted by 

(13) Jh:H^C{X) 

we have \\Jh\\ < K. Furthermore, let us recall the representation of H based 
on Mercer's theorem (see [13]). To this end let Kx :L2{X) — > L2{X) be the 
integral operator defined by 

(14) Kxf{x):= f kix,x')f{x')dx', feL2{X),xeX, 

Jx 

where L2{X) denotes the L2-space on X with respect to the Lebesgue mea- 
sure. Then it was shown in [13] that the unique square root i^j/^ of Kx is 
an isometric isomorphism between L2{X) and H. 

3.2. Proof of Theorem 2.1. In order to prove Theorem 2.1 we need the 
following result which bounds the covering numbers of H„{X) with respect 
to C{X). 

Theorem 3.1. Let a >l, <p <2 and X CW^ be a compact subset 
with nonempty interior. Then there is a constant Cp d > independent of a 
such that for all e > we have 

logM{BH^^x),eMX)) < Cp,,^7(i-f/4) V^'. 
Proof. Let Bd be the closed unit ball of the Euclidean space M.'^ and 

o 

be its interior. Then there exists an r > 1 such that X C rB^. Now, 



12 



I. STEINWART AND C. SCOVEL 



it was recently shown in [32] that the restrictions H^^rBd) Hu{X) and 

o 

Hcr{rBd) Hfj{Bd) are both isometric isomorphisms. Consequently, in the 

o 

following we assume without loss of generality that X = or X =B^ and 
do not concern ourselves with the distinction of both cases. 

Now let us write H„ := H„{X) and := Jh^ '-Ha — > C{X) in order to 
simplify notation. Furthermore, let ■.L2{X) L2{X) be the integral op- 
erator of k(j defined as in (14), and || • || denote the norm in L2{X). According 
to [13], Theorem 3, page 27, for any / G H^j^ we obtain 

inf \\f-h\\<h\K~'l^ff = h\ffH^, 
\\K-^h\\<R a K " 

where we use the convention = oo if /i ^ Ka-L2{X). Suppose now 

that 7i C L2{X) is a dense Hilbert space with \\h\\ < ||/i||-H) and that we 
have Kfj^ : L2{X) -^TL C L2{X) with \\Ka- : L2{X) — > < c^^-j-i < oo for some 
constant Ca^n > 0. It follows that 



inf Jf-h\\< inf \\f-h\\< 



1 



2 

Ha 



\\h\\n<Ca,HR \\K~^h\\<R R 

and hence 

By [27], Theorem 3.1 it follows that / is contained in the real interpolation 
space {L2{X),Ti)i/2^oc (see [7] for the definition of an interpolation space) 
and its norm in this space satisfies ||/|| 1/2,00 1^ '^^/c^\\f\\Hcy■ Therefore we 
obtain a continuous embedding 



Ti:H^^{L2{X),n) 



1/2,00 ) 



with ||Ti|| < 2^C(,^n- If ill addition a subset inclusion (L2(X), 7^)1/2,00 C 
C(X) exists which defines a continuous embedding 

T2:(L2(X),W)i/2,oo-C'(X), 

we have a factorization Jo- = T2T1 and can conclude 

(15) \o^M[Bua^x) , e, C{X)) = log AA( J,, e) < logA^f T2, 

Consequently, to bound logM{Ja,£) we need to select an Ti, compute Ca^n 
and bound logAA(T2,e). To that end let TC := W"^{X) be the Sobolev space 
with norm 



|ci|<-m 
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where |a| ■=J2i=iOii, :=nf=i'9i"% and d°'^ denotes the Ojth partial 
derivative in the ith. coordinate of M.'^. By the Cauchy-Schwarz inequahty 
we obtain 

(16) < ll/f /■ [ \D^k„ix,x)\^dxdx, 

where the notation indicates that the differentiation takes place in the 
X variable. To address the term D"k„{x,x) we note that 

Z)^(e-l^l') = (-l)l"le-l^l'/2/i„(2:), 

where the multivariate Hermite functions ha{x) = Y[i=i ^aii^i) are products 
of the univariate functions. Since ^^h\{x) dx = 2^k\y/TT (see, e.g., [11]) we 
obtain 



(17) 



where we have used the definition a! :=nf=iQ!j!. Applying the translation 
invar iance of k^j, we obtain 

\D'^k^{x,x)\'^dx= [ \D2k^{0,x)\'^dx= f |L'?(e-"'l^l')|^(if, 



|D°(e-l^l')|^(ix= / e-\''\\lix)dx 

< f hl{x)dx = 2^''^al7r^/\ 



and by a change of variables we can apply inequality (17) to the integral on 
the right-hand side, 

/ |D|(e-'^'l^l')|'dx = a2H-'i / |D?(e-l^l')|2dx<c72|-M2Ha!7r'^/2. 
Hence we obtain 

' \D^k^{x, x)\'^dxdx< 6'(d)a2|°l-'^2l"la!7r'^/2^ 



IX Jx 

where 9{d) is the volume of X. Since J2\a\<m'^^- ^ d^m\'^ and ||i^o-/||m = 
S|a|<m ll-^"-^o-/|P we can therefore infer from (16) that for cr > 1 we have 

(18) \\K„\\ < y^(2(i)"^/2^!'^/2^'m-d/2 _ 

Now let us consider T2: {L2{X),W"'{X))i/2,oo ^ C{X). According to 
Triebel [34], page 267, we have 

iL2iX),W^{X))y2,oo = {L2{X),W^{X)\,2,^=Bll^{X) 
isomorphically. Furthermore 

(19) \ogM{B^!^{X) ^ C(X),e) < c^,de-'"/"^ 
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for m> d follows from a similar result of Birman and Solomjak ([8], cf. also 
[34]) for Slobodeckij (i.e., fractional Sobolev) spaces, where the constant Cm,d 
depends only on m and d. Consequently we obtain from (15), (18) and (19) 
that 

logAA(J,,e) < 



= c™,d(4c<,,^)'^/-e-2'^/- 

- r ,^rf-'iV(2m)p-2d/m 

for all 771 > d and new constants Cm,d depending only on m and d. Setting 
m := 2d/p completes the proof of Theorem 3.1. □ 

Proof of Theorem 2.1. As before we write := Hc,{X) and := 
Jh^-H„ C{X) in order to simplify notation. Furthermore recall for a 
training set T G {X x y)" the space L2{Tx) introduced in Section 2.2. Now 
let Rtx '■ C{X) — > L2{Tx) be the restriction map defined by / i— > f\Tx- Ob- 
viously, we have ||-RtxII ^ 1- Furthermore we define 1^ := ° J a so that 
/ct : — > L2(Tx) is the evaluation map. Then Theorem 3.1 and the product 
rule for covering numbers imply that 

(20) sup logAA(/,,e) < c,,,^^^-'?/^)'^^-^ 

for all < q <2. To complete the proof of Theorem 2.1 we derive another 
bound on the covering numbers and interpolate the two. To that end observe 
that I(j:Hfj — > L2{Tx) factors through C{X) with both factors Js and Rtx 
having norm not greater than 1. Hence Proposition 17.3.7 in [23] implies 
that la is absolutely 2-summing with 2-summing norm not greater than 
1. By Konig's theorem ([24], Lemma 2.7.2) we obtain for the approxima- 
tion numbers (afc(/o-)) of 1^ that X]fc>i "^K-^o-) ^ 1 fo'^ o" > 0. Since the 
approximation numbers are decreasing it follows that sup^ \/A;afc(/o-) < 1- 
Using Carl's inequality between approximation and entropy numbers (see 
Theorem 3.1.1 in [10]) we thus find a constant c > such that 

(21) sup logAA(/^,e) <ce 



-2 



for all e > and all cj > 0. Let us now interpolate the bound (21) with 
the bound (20). Since ||/o-:-ffo- — >L2(Tx)|| < 1 we only need to consider 
< e < 1. Let < g < p < 2 and < a < 1. Then for < e < a we have 

logAA(/^,e) < c<j,rf(7(i-«/4)''e-« < Cg,da(i-«/4)'^aP-%-P, 

and for a < e < 1 we find 

logA^(/<„e) < < caP-'^e-P. 
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Since a > 1 we can set a := cj"((''~'?)/(^~^'?))'' and obt 



am 



(l-p/2)({8-2g)/{8-4g))d^ 



logAA(/^,e) < Cq^dO- 
where Cq^d is a constant depending only on q, d. The proof is completed by 



choosing q :- 



45 
1+25 



when 6 < 



2p 

8-4p 



and q just smaller than p otherwise. □ 



4. Proofs of Theorems 2.7 and 2.6. In this section we prove Theorems 2.7 
and 2.6, which both deal with the geometric noise exponent. 

4.1. Proof of Theorem 2.7. Let us begin by recalling some facts about 
Gaussian RBF kernels. To this end let HfjCM!^) be the RKHS of the Gaussian 
RBF kernel with parameter a. Then it was shown in [32] that the linear 
operator : LslM'^) ^ H^iR"^) defined by 



vr 



d/4 



is an isometric isomorphism. Consequently, we obtain 
(22) a^{X) = inf HdW^^^) + 7^/,p(T45) - T^i,P, 



A>0. 



In the following we will estimate the right-hand side of (22) by a judicious 
choice oi g. To this end we need the following lemma, which in some sense 
enlarges the support of P to ensure that all balls of the form B{x,Tx) are 
contained in the (enlarged) support. This guarantee will then make it possi- 
ble to control the behavior of V^g by tails of spherical Gaussian distributions 
[see (28) for details]. 

Lemma 4.1. Let X be a closed unit ball of M.'^ and P be a probability 
measure on X x Y with regular conditional probability r]{x) = P{y = l\x), 
X £ X . On X := 3X we define 

r]{x), i/|x|<l, 
r] i - — j- I , otherwise. 



(23) 



ri{x) 



We also write X-\ := {x G X : r]{x) < ^} and Xi := {x £ X : fi{x) > i}. Fi- 
nally let B{x,r) denote the open ball of radius r about x in M.'^. Then for 
x G Xi we have B{x,Tx) C Xi and for x G X_i we have B{x,Tx) C 

Proof. Let x e Xi and x' G B{x,Tx). If x' G X we have |x — x'\ < 
which implies r]{x) > ^ by the definition of Tx- This shows x' G Xi. Now 
let us assume \x'\ > 1. By < and Pythagoras' theorem we then 

obtain 



< 



(x, x')x' 



„l\2 



+ 



(x, x')x' 



„l\2 



X 
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Therefore, we have IjfTy — a;| < t^, which imphes 'q{x') = ??(|f7|) > ^- □ 

Let us finahy recah that Zhang showed in [39] that the hinge risk satisfies 

(24) 7^^,p(/) - 7^^,p = Ep^(|2r/ - 1| • 1/ - /p|) 

for all measurable / : X — > [—1, 1]. Now we are ready to prove Theorem 2.7. 

Proof of Theorem 2.7. With the notation of Lemma 4.1 we fix a 
measurable /p :X — > [— 1, 1] that satisfies /p = 1 on Xi, /p = — 1 on X_i 
and /p = otherwise. For g := (a^ j-^Y"!^ jp we then immediately obtain 

(25) ii5|Il,(r^) < (^^j m. 

where 0(d) denotes the volume of X. Moreover, it is easy to see that — 1 < 
fp <l implies — 1 < Va-g < 1. Since Px has support in X, (24) then yields 

(26) 7^^,p(y,<7) - T^i,P = Ep J|2r? - 1| • - /p|). 
In order to bound \Vcrg{x) — fp{x)\ for x G Xi we observe 

/r, 2\d/2 r 

K5(^) = ( — ) / e~2'^^Mi/p(y)dy 

(27) =(^)'^7 e"^-^ll-^lli(/p(y) + l)d,-l 

— ) / e-2-^ll--^lli(/p(y) + l)d2/-L 

Now remember that Lemma 4.1 showed B{x,Tx) C Xi for all x € Xi, so that 
(27) implies 

V,g{x)>2—] / e-2-'ll--?^lli(iy-l 
V vr / Jb(x,t,) 

(28) 

= l-2P^^{\u\>T,), 

where 7^ = (2(jV7r)'^/2e-2'^'l^l' is a spherical Gaussian in M'^. According 
to the tail bound [17], inequality (3.5) on page 59, we have -Ry^dtil > ?") < 
^g-cr r /2d consequently we obtain 

1 > ^a5(x) > 1 - Se-'^'^'/^d^ xeXi. 
Since for x £ X-i we can obtain an analogous estimate, we conclude 

\VM^)-fp{x)\<8e-'^'-'/^^ 
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for all X G Xi UX_i. Consequently (26) and the geometric noise assumption 
for t := 1^ yield 

ni,p{Vag)-ni,p < 8E,^pJ|277(x) - l|e-'^'^'/2d) 

(29) 

< 8C7(2d)"'^/V-"^ 

where C is the constant in (8). Combining (29), (25) and (22) now yields 
the assertion. □ 

4.2. Proof of Theorem 2.6. In this subsection, all Lebesgue and Lorentz 
spaces (see, e.g., [5]) and their norms are with respect to the measure Px- 

Proof of Theorem 2.6. Let us first consider the case q>l where we 
can apply the Holder inequality for Lorentz spaces [22], which states 

||/9||i<||/||,,oo||ff||gM 

for all / G ig,oo) 9 G and q' defined by | + ^ = 1. Applying this in- 
equality gives 

E,^P^(|27?(rE)-l|e-^'/*) 

(30) < ||(2r/ - l)-i||,,oo|k^ (2r?(x) - 
<C||(2r?-l)2e-(|2''"i|/-^)'/^*-^||^,^,, 

where in the last estimate we used the Tsybakov assumption (5) and the 
fact that P has an envelope of order 7. Let us write h{x) := \2r]{x) — , 
X £ X , and b := t{c^)'^^"' so that 

|27?(x) - l|2e-(|2^-i|/'=^)'/"*-' =g(/i(x)), 

where g{s) := s~^e~^* f^j. g > 1, Now it is easy to see that g : 

[1,00) — > [0,00) is strictly increasing if < 6 < and hence we can extend 
5 to a strictly increasing, continuous and invertible function on [0, 00) in this 
case. Let such an extension also be denoted by g. Then for this extension 
we have 

(31) Px{goh>T) = Px{h>g-\T)). 

Now for a function f -.X ^ [0, 00) recall the nonincreasing rearrangement 

f*{u):=mi{a>0:Pxif>cr)<u}, u > 0, 

of / which can be used to define Lorentz norms (see, e.g., [5]). For u> 
equation (31) then yields 

{g o hy{u) = g{m{{g-\a) :Px{h> g-\a)) <u})=go h*{u). 
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Now, inequality (5) implies Pxih > {^y^'^) l^u for all u > 0. Therefore, we 
find 

,'C\ ^1"^ 

/i* (n) < inf{cr > : (/i > 0-) < n} < 



for all < u < 1. Since o K)* = g o h* and g is increasing we hence have 



(5o/ir(u)<5((^) 



for all < u < 1. Now, for fixed a > the bound e ^ ^ , j^, " on (0, cxd) 

' — In^ ^ ' 

implies 

g2(a/7-l) 

for s £ [1, oo). Using the fact that (goh)* (u) = holds for all u > 1, we hence 
obtain 

^2/g(l-«/7) 

^^°'^*^"^-^" ln^((VC)^/(^^)5-^) + l 

for ti > if we assume without loss of generality that C > 1. Let us define 
a := 7^Y^- Then we find Jt- + |(1 — ^) = and consequently for 6 < that 

* - 37(c^)2/7 ' we obtain 

/•oo 

ll!f°'>ll,',l= / t."'-'(9°'!)*(!')<i« 
(32) ^° 

- Jo ln2((n/C7)2/(97)6-i) + l " 

by the definition of b. Since we also have Ep^{\2ri{x) — l\e~'^^^^) < 1 for all 
t > 0, estimate (30) together the definition of g and (32) yields the assertion 
in the case q>l. 

Let us now consider the case < g < 1 where the Holder inequality in 
Lorentz space cannot be used. Then for all t, r > we have 

= f \2ri{x) - l\e-^-/^Px{dx) 

J\2ri-1\<T 

(33) 

+ / \2ri{x)-l\e~^-/^Px{dx) 

J\2n~l\>T 



<Cr'?+i+exp(-(-VVi 
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where we have used the Tsybakov assumption (5) and the fact that P has 
an envelope of order 7. Let us define r by t'^'^^ ■= exp{—{-^)'^^"'t^^). For 

a := (c^)^/'^(g + 1) and small t this definition implies 

and hence the assertion follows from (33) for the case < q <1. □ 

5. The estimation error of ERM-type classifiers. To bound the estima- 
tion error in the proof of Theorem 2.8 we now establish a concentration 
inequality for ERM-type algorithms using a variant of Talagrand's con- 
centration inequality together with local Rademacher averages (see, e.g., 
[2, 4, 21]). Our approach is inspired by [3]. However, due to the regulariza- 
tion term A||/|||^ in the definition of SVMs we need a more general result 
than that of [3]. 

This section is organized as follows: In Section 5.1 we present the required 
modification of the result of [3]. Then in Section 5.2 we bound the resulting 
local Rademacher averages. 

5.1. Bounding the estimation error for ERM-type algorithms. We first 
have to introduce some notation. To this end let ^ be a class of bounded 
measurable functions from Z to M such that J-' is separable with respect 
to II • I loo- Given a probability measure P on Z we define the modulus of 
continuity of by 

a;„(J^,e) :=a;p,„(J^,e) :=ET~pn( sup |Ep/-Et/|), e > 0, 

Epp<e 

where we note that the supremum is, as a function from Z to M, measurable 
by the separability assumption on J^. Now, a function L : x Z ^ [0, 00) is 
called a loss function if L o / : = L(/, •) is measurable for all / G JT. Given a 
probability measure P on Z we indicate by /p^jp £ a minimizer of 

f^niAf) ■.= ^zr.pL{f,z). 

Throughout this paper 7^l,p(/) is called the L-risk of /. If P is an empir- 
ical measure with respect to T £ Z^ we write fx^j^ and TZl,t{-) as usual. 
For simplicity, we assume throughout this section that /p^jf and /t,:f do ex- 
ist. Furthermore, although there may be multiple solutions we use a single 
symbol for them whenever no confusion regarding the nonuniqueness of this 
symbol can be expected. An algorithm that produces solutions /t,:f is called 
an empirical L-risk minimizer. Moreover, if is convex, we say that L is 
convex if L{-,z) is convex for all z £ Z. Finally, L is called line- continuous 



20 



I. STEINWART AND C. SCOVEL 



if for all z & Z and all f,f€zJ- the function t L{tf + (1 — t)f, z) is contin- 
uous on [0, 1]. If ^ is a vector space then every convex L is line-continuous. 
Now the main result of this section reads as follows: 

Theorem 5.1. Let T he a convex set of hounded measurable functions 
from Z to M, and let L:J^ x Z ^ [0, cxo) be a convex and line- continuous loss 
function. For a probability measure P on Z we define 

g:={Lof-Lofp^^:feT}. 

Suppose that there are constants c> 0, 0<a<l, 6 > and B > with 
Ep5^ < c(Ep5)" -|- 6 and \\g\\oo ^ B for all g . Furthermore, assume that 
Q is separable with respect to \\ ■ ||oo- Let n > 1, x>l and e > with 

(34) e> 10max|w„(a,ce" -F(5), 

Then we have 

Pr*(r G :7^L,p(/T,^) < 7^L,p(/p,^) + e) > 1 - e-^ 

Remark 5.2. Theorem 5.1 has been proved in [3] for (5 = 0, where it 
was used to find learning rates faster than n~^/^ for certain ERM-type al- 
gorithms. At first glance such fast rates are impossible if (5 > 0. However, we 
will see later that for SVMs we have 5 = a^{)^) for a suitable k > depending 
on both Tsybakov's and the geometric noise exponent, and hence we have 
5 — > for n — > oo. 

As already mentioned, the proof of Theorem 5.1 is based on Talagrand's 
concentration inequality in [33] and its refinements in [16, 20, 25]. The ver- 
sion below of this inequality is derived from Bousquet's result in [9] using a 
little trick presented in [2], Lemma 2.5. 

Theorem 5.3. Let P he a probability measure on Z and TC be a set of 
bounded measurable functions from Z toM which is separable with respect to 
II • lloo and satisfies Ep/i = for all h & Furthermore, let b> and r > 
be constants with ||/i||oo < & and Ep/i^ < r for all h ^H. Then for all x >1 
and all n>l we have 

pn fj, ^ . ^ 3Er'^pn supET'h+ J— + —) < e"^. 

V h&n hen V n n J 

This concentration inequality is used to prove the following lemma which 
is a generalized version of Lemma 13 in [3]. 



Idx /4cxy/(2-a) 

n \ n J 



n 
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Lemma 5.4. Let P be a probability measure on Z and Q be a set of 
bounded measurable functions from Z to M which is separable with respect 
to II • lloo- Let c>0, 0<a<l, 5 >0 and B > be constants with Kpg'^ < 
c{Kpg)°' + 6 and \\g\\oo < B for all g G G- Furthermore, assume that for all 
T £ and all e > for which for some g £Q we have 

ETg<e/20 and Epg>e 

there exists a g* gQ which satisfies 

ETg*<e/20 and Epg*=e. 

Then for all n>l, x>l, and all e > satisfying (34), we have 

Pr*(r G Z" : for allgeG with Erg < e/20 we have Epg < e) > 1 - e''' . 

Proof. We define 7i := {Epg — g:g G G,Epg = e}. Obviously, we have 
Eph = 0, \\h\\oo < 2B, and Eph^ = Epg'^ - (Epg)^ < ce" + 5 for all HgH. 
Moreover, since it is also easy to verify that 7i is separable with respect to 
II ■ lloo) our assumption on Q yields 

Pr*(r G : 3^ G a with Erg < e/20 and Epg > e) 

<Pr*(rG Z":35G a with Epg -Erg > l9e/20 and Epg = e) 

<P''{tg Z" : sup EtH > 19e/2o) . 
V hen J 

Note that since 7i is separable with respect to || • ||oo, the set on the last line 
is actually measurable. In order to bound the last probability we will apply 
Theorem 5.3. To this end we have to show 



19e , 2xT bx 

— — > SEx'r^pn sup Ex'h + \ 1 . 

20 hen V n n 

Our assumptions on e imply 

(35) e>10ET'^pn( sup |Ep5 - E^'^l ) > lOE^'^pn sup E^'/i. 

V g&g, J h&n 



Furthermore, since 10> (§)2 and 0<a< 1 we have 

/ Arr\ V(2-a) /cn\ 2/(2-q) /A^^X 1/(2- 



a) 



li 5 < ce" a simple calculation hence shows > y ^^^^^^^-^-^ ■ Furthermore, 
if (5 > ce" the assumptions of the theorem show 



_ Sx 60 U6x 60 /2(ce" + (^)x 

e > low — > — \ > — -1/ ^ —■ 

n~19Vn~19V n 
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Hence we have > y ^'^^ n^^^ ^ satisfying the assumptions of the 

theorem. Now let r := ce" + 5 and h := 2B. By (35) and e > we then 
find 



19e /2jrr 6x 

> SE^-z^pn sup Eyz/i + \ 1 . 

20 heH V n n 

Applying Theorem 5.3 then yields 

Pr*(rG Z":35G a with Etc/ < e/20 and Epc/ > e) 

< ( T G : sup Et/i > 19e/20 ) 
V hen / 



/ /2j;t 6x\ 

< T e : sup Et/i > 3ET'~pn sup E^'/i + a/ h — 

V /le^^ hen V n n / 



With the help of the above lemma we can now prove the main result of 
this section, that is, Theorem 5.1. 



Proof of Theorem 5.1. In order to apply Lemma 5.4 to the class Q 
it obviously suffices to show the richness condition on Q of Lemma 5.4. To 
this end let f £ J- with 

Et{Lo f - Lo fpjr)< e/20 and Ep{L o f - L o fpjr) > e. 

For t £ [0, 1] we define ft := t/ + (1 — t)fp^jr. Since !F is convex we have 
ft £ for all t £ [0, 1]. By the line-continuity of L and Lebesgue's theorem 
we find that the map h:t h^'Ep{L o ft — L o fpjr) which maps from [0, 1] to 
[0, B] is continuous. Since /i(0) = and /i(l) > e there is a t E (0, 1] with 

¥.p{Loft-Lofp^:F) = h{t)=e 

by the intermediate value theorem. Moreover, for this t we have 

Er(L oft-Lo fp^:p) < EritL o / + (1 - t)L o /p_^ - L o /p,^) < e/20. 

Now, let e > with e > 10max{u;„(g, ce" + 5), (^)^/(^""\ 
Then by Lemma 5.4 we find that with probability at least 1 — e~^, ev- 
ery / G with Et{L of-Lo fp^-p) < e/20 satisfies Ep(L o / - L o fp-p) < e. 
Since we always have 

^t{L o fp,^ - L o fp^p) < < e/20. 



we obtain the assertion. □ 
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5.2. Bounding the modulus of continuity. The aim of this subsection is 
to bound the modulus of continuity of the class G in Theorem 5.1 with 
the help of covering numbers. We then present the resulting modification of 
Theorem 5.1. 

Let us begin by recalling the definition of (local) Rademacher averages. 
To this end let ^ be a class of bounded measurable functions from Z to M 
which is separable with respect to || • ||oo- Furthermore, let P be a probabil- 
ity measure on Z and (ej) be a sequence of i.i.d. Rademacher variables (i.e., 
symmetric {—1, l}-valued random variables) with respect to some probabil- 
ity measure on a set 0. Then the Rademacher average of J- is 



Radp(.7^, n) := Jiad{J^, n) := EpnE^ sup 



1 



and for e > the local Rademacher average of J- is defined by 



Rad(jr, n, e) := Radp(jr, n, e) := EpnE^ sup 



1 " 



1=1 



For a given a > we immediately obtain Rad(a.F, n) = aRad(^, n) and 

(37) Rad(a^, n,e) =a Rad(jr, n, a^'^e) . 

Moreover, by symmetrization the modulus of continuity can be estimated 
by the local Rademacher average. More precisely, we always have (see [36]) 

(J^,e) < 2Radp(J",n,e), e > 0. 

Local Rademacher averages can be estimated by covering numbers. With- 
out proof we state a slight modification of a corresponding result in [21]: 

Proposition 5.5. Let T he a class of measurable functions from Z to 
[—1,1] which is separable with respect to \\ ■ ||oo and let P be a probability 
measure on Z . Assume there are constants a > and <p <2 with 

sup logAf{J^,e,L2{T)) < ae-P 

for all e > 0. Then there exists a constant Cp> depending only on p such 
that for all n>\ and all e >0 we have 

^^x 1/2 ..2/(2+p). 



Rad(.7^, n, e) < Cp max|e 



1/2-P/4 



n/ \n 



Using this proposition we can replace the modulus of continuity in Theo- 
rem 5.1 by an assumption on the covering numbers of Q. Assuming that all 
resulting minimizers exist, the corresponding result then reads as follows: 
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Theorem 5.6. Let T he a convex set of hounded measurable functions 
from Z to M and let L : x Z ^ [0, oo) be a convex and line- continuous loss 
function. For a prohability measure P on Z we define 

g:={Lo/-Lo/p^^:/G^}. 

Suppose that there are constants c>0, 0<a<l, 6 > and B > with 
¥,pg^ < c{¥ipg)°' + 6 and \\g\\oo < -B for all g & Q . Furthermore, assume that 
Q is separable with respect to \\ ■ ||oo and that there are constants a > 1 and 
0<p<2 with 

(38) sup logM{B~^g, e, L2{T)) < ae-^ 

for all e > 0. Then there exists a constant Cp> depending only on p such 
that for all n>l and all x>l we have 

Pr* (T G : TZlAM > ^LAfp^^f) + Cpe{n, a, B, c, 6, x)) < e-^ 
where 

e{n, a, B, c, 6, x) 

/ „ \ 2/{4-2a+ap) / « \ 1/2 

._ ^2p/(4-2a+ap)^(2-p)/(4-2Q+Qp) / a \ _^ ^P/2^(2-p)/4 / « \ 

Proof. By (37) and Proposition 5.5 we find 

r /n\V2 /„\2/(2+p)~, 

Rad(g,n,e) < Cpmax|i?^^/V/2-f/4f^j ,b(^] |. 

We assume without loss of generality that Cp > 5. Let e* > be the largest 
real number that satisfies 



(39) e* = 2cpBP/^ (c(e* )" + 5) ^''"-^'^ ( - 

\n 

Furthermore, let e > be such that 
e = 2cpmax|5P/2(ce" + 5) 



r, \ 1/2 

(2-p)/4/«' 



B 



n 

a\2/(2+p) jsx /4cxy/(2-") Bx 
nj \ n \ n J 'n 



It is easy to see that both e and e* exist. Moreover, our above considerations 
show e> 10max{a;„(g,ce" + 5),(^)i/2,(^)i/(2-")^^}^ that is, e satisfies 
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the assumptions of Theorem 5.1. In order to show the assertion it therefore 
suffices to bound e from above. To this end let us first assume that 

\n J I \n J \ n \ n J n } 

Then we have e = 2cpBP/'^{ce°' + 5){2-p)/4(£)i/2_ gi^^e e* is the largest so- 
lution of this equation we hence find e < e* . This shows that we always 
have 



e<e* + 2cp[B[-] +._+ + 

V \n / \ n \ n J n 

Hence it suffices to bound e* from above. To this end let us first assume 
c{e*Y > 5. This implies e* < AcpBP/'^{c- (e*)")i/2-p/4(^)i/2^ j^^^^^ 
find 



2/(4-2a+ap) 



e* < 16c2s2p/{4-2a+ap)^(2-p)/(4-2a+ap) 

Conversely, if c(e*)" < 5 holds, then we immediately obtain 



e*<4c„i?^'/V2-f)/4 - 



6. Variance bounds for SVMs. In this section we prove some "variance 
bounds" in the sense of Theorem 5.6 for SVMs. Let us first ensure that 
these classifiers are ERM-type algorithms that fit into the framework of 
Theorem 5.6. To this end let be a RKHS of a continuous kernel over X, 
A > 0, and / : y x M — > [0, oo) be the hinge loss function. We define 

(40) L{f,x,y):=X\\f\\]j + l{yJ{x)) 
and 

(41) L{f,b,x,y):=\\\f\\l + l[yJ[x) + b) 

for all / G 6 G M, X G X and y £Y. Then TZl,t{-) and TZl,t{--, •) obviously 
coincide with the objective functions of the SVM formulations and there- 
fore SVMs are empirical L-risk minimizers. Furthermore note that all above 
minimizers exist (see [31]) and thus the SVM formulations in terms of L 
actually fit into the framework of Theorem 5.6. 

In the following, fi^p denotes a minimizer of IZi^p if no confusion can arise. 
For the shape of these minimizers which depend on 77 := P(y = 1|-) we refer 
to [39] and [30] . Now our first result is a variance bound which can be used 
when considering the empirical Z-risk minimizer. 
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Lemma 6.1. Let P be a distribution on X x Y with Tsybakov noise 
exponent < q < oo. Then there exists a minimizer fi^p mapping into [—1,1] 
such that for all bounded measurable functions / : X ^ M we have 

Ep{l of-lo fl^pf 

< C,,,(||/||oo + l)(''+2)/(''+^)(Ep(/ o / - ^ o fi,p)f^'^^'\ 
where Cj^^q := || {2r] — l)~^||g,oo + 2 if q> and Crj^q = 1 if q = 0. 

Proof. For q = the assertion is trivial and hence we only consider 
the case q > 0. Given a fixed x £ X we write p := P{l\x) and t := f{x). In 
addition, we introduce 

v{p,t) ■.= p{l{l,t)-l{l,fi^p{x))f + {l-p){li-l,t)-li-lJi,p{x))f, 

m{p,t) :=p{l{l,t)-l{l,fi^p{x))) + il-p)ili-l,t)-l{-l,fi,p{x))). 

Since Tsybakov's noise assumption implies Px{Xq) = 0, we can restrict our 
consideration to p^ 1/2. Now we will begin by showing 

(42) vip, t) < (^\t\ + p^^) Hp, t)- 

Without loss of generality we may assume p > 1/2. Then we may set fi^p{x) := 
1 and thus we have fi^p{x)) = and l(— 1, fi,p{x)) = 2. 

Let us first consider the case t G [—1, 1]. Then we have ^(l,i) = 1 — t and 
/(— l,t) = 1 + t, and therefore (42) reduces to 

(N + ^^)(2p-l)(l-t). 

Obviously, the latter inequality is equivalent to 1 — t < (2p — l)|t| + 2, which 
is always satisfied for t G [—1, 1] and p> 1/2. 

Now let us consider the case t < — 1. We then have Z(l,t) = 1 — t and 
Z(— l,t) = 0, and after some elementary calculation we hence see that (42) is 
satisfied if and only if 

p2(6-2t)-p(5-3t)-2t>0. 

The left-hand side is minimal if p = (5 — 2>t) / (12 — 4t), and thus we obtain 

n, , , , 7t2-18t-25 
p\Q - 2t) - p{5 - 3t) -2t> . 

Consequently, it suffices to show — 18t — 25 > 0. However, the latter is 
true for all t < — 1 since 1 1— > — 18t — 25 is decreasing on (— oo, —1]. 

Now let us consider the third case, t > 1. Since we then have 1(1, t) = 
and /(—I, t) = 1 + t it suffices to show 

2 



FAST RATES FOR SUPPORT VECTOR MACHINES 



27 



However, this is obviously true, and hence we have proved (42). Now, let us 
write 

g{y, x) := l{y, f{x)) - l{y, fi,p{x)), 
hi{x) := r]{x)g{l, x) + (1 - r]{x))g{-l, x), 
h2{x) := r]{x)g^{l,x) + (1 - i]{x))g^{-l,x). 

Then (42) yields /i2(x) < (||/||oo + |2r,(x)-i| ^ ^'^^ ''^^^ ^ ^Z^- 

For t > 1 we hence find 

Epg^= [ h2dPx+ [ ^ h2dPx 

J\2r]-l\~^<t J t<\2r]~l\-^ <oc 

<(ii/iu+2t)/ hidPx+l ^ mu + ifdPx 

< 2(||/||oo + mpg + (ll/lloo + l?Px{\2ri - 1^^ > t) 

< 2t(||/||oo + l)Epg + (ll/IU + l)'||(2r7 - l)-^\U,oot-'' . 

Let us define t by := (||/||oo + l)(Kpgy\ Since Epg < ||/||oo + 1 we 
have t>l and hence the above estimate yields the assertion. □ 

In the case of SVMs with ofi'set we also need the following lemma which 
bounds the size of the offset bp^x. This lemma has been proved in [15] for 
empirical distributions. Although its generalization to general probability 
measures is straightforward we include the proof for completeness. 

Lemma 6.2. Let P be a distribution on X xY and A > 0. Then for all 
possible pairs {fp,x,bp^x) G H we have 

\bp,\\ < ||/p,a||oo + 1- 

Proof. If P{y = y*\x) = 1 Px-as. for some y* G Y , there is nothing 
to be proved since bp^x = U* by our assumption on SVMs mentioned in 
Section 2. Now let us assume that bp^x > ||/p,a||oo + 1 and that P is not 
degenerate in the above way. Then there exists a constant 6 > such that 
^p,\ > ||/p,a||oo + 1 + 5. This implies fp^x{x) + bp^x > 1 + 5 for all x G X. We 
define b*p^ := bp^x — 5. Obviously, we then find l{—l,fp^x{x) + &p,a) = = 
l{ljp^x(.x) +b*p.^) and 

l{l, fp,x{x) + bpx) = 1 + fp,x{x) + b*p^x + S = l(.-hfp,x{x) + b*px) + S 

for all X £ X. Therefore we obtain TZi^p{fp,x + &p,a) > T^i,p{fp,x + ^pa) ^y 
using the assumption on P. □ 
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The proof of the above lemma can be easily generalized to a larger class of 
loss functions including, for example, the squared hinge loss. With the help of 
Lemma 6.1 we can now show a variance bound for SVMs. For brevity's sake 
we only state and prove the result for SVMs without offset. Therefore, the 
loss function L is defined as in (40). Considering the proof, it is immediately 
clear that the variance bound also holds for the SVM with offset. 

Proposition 6.3. Let P he a distribution on X x Y with Tsybakov 
noise exponent < q < oo. We define C := 16 + 8||(2r7 — l)^"'^||q,oo if Q > 
and C := 8 otherwise. Furthermore, let X> and < 7 < A^^^^ such that 
fp,\ £ 7-6//. Then for all f S 'jBh we have 

E(L o / - L o fp^^f < C{Kj + l)(5+2)/('?+i) (E(L o / - L o fp,^)f^''+'^ 

+ 2CiKj + l)('?+2)/('/+i)o9/(</+i)(A). 

Proof. We define C* := (i^7 + l)(''+2)/(g+i) and fix an / G 7^^. Further- 
more, we choose a minimizer fi^p according to Lemma 6.1. Using (a + 6)^ < 
20^ + 26^ for all a, 6 € M we first observe 

E(Lo/-Lo/p,;,)2 

< 2A2||/f + 2\''\\fp,x\\^ + 2E{1 of-lo fp^^f 

< 4E{1 of-lo fi^pf + 4E(/ o fi^p - I o fp^^f + 2A2||/f + 2A2||/p,;,f 

< 4C^,,C(E(/ of-lo fi^p) + E(/ o fp^^ - I o fi,p)f^'^+^^ 
+ 2A2||/f + 2A2||/p,;,||4, 

where in the last step we have used Lemma 6.1 and + lf < 2(a + hY for 
all a, 6 > 0, < p < 1. Since A||/|p < 1 and A||/p^a|P < 1> we can continue, 

E(Lo/-Lo/p_;,)2 

<CC{¥.{lof-lofi^p) 

+ E(/ o /p,, - / o V) + A^ll/f + X^fp^xty'^'^^^ 

< CCiEiL of-Lo fp^y) + 2E(Z o fp^^ - I o fi^p) + 2A||/p,Af )^/^^+'^ 

< C7C'(E(L of-Lo fp^x))"'^"^'^^ + 2C(7a«/(«+i) (A). □ 

7. Proof of Theorem 2.8. In this last section we prove our main result, 
Theorem 2.8. Since the proof is rather complex we split it into three parts. 
In Section 7.1 we estimate some covering numbers related to SVMs and 
Theorem 5.6. In Section 7.2 we then show that the trivial bound ||/t,a|| ^ 
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A can be significantly improved under the assumptions of Theorem 2.8. 
Finally, in Section 7.3 we prove Theorem 2.8. 

7.1. Covering numbers related to SVMs. In this subsection we establish 
a simple lemma that estimates the covering numbers of the class Q in The- 
orem 5.6 in terms of the covering numbers of Bh- For brevity's sake it only 
treats the case of SVMs with offset. The other case can be shown completely 
analogously. 

Lemma 7.1. Let H be a RKHS over X such that K defined by (12) 
satisfies K > 1/2, P be a probability measure on X x Y , A > 0, and L be 
defined by (41). Furthermore, let 1 <j < A~^/^ and 

T ■= {(/, b)^H X M: 11/11^^ < 7 and \b\ <jK+ 1}. 

Defining B := 2jK + 3 and G := {Lo{f,b) - Lo {fpyr, bp^jr) : (/, b) G JF} then 
gives \\g\\oo < B for all g £Q, where [fpjr^bp^jr) denotes a L-risk minimizer 
in T . Assume that there are constants a > 1 and <p <2 such that for all 
e > we have 

sup logAf{BH,e,L2{Tx))<ae^P. 

Then there exists a constant Cp > depending only on p such that for all 
e > we have 

sup logM{B-^g,e,L2{T)) < Cpae'P. 

Proof. Let us write g ■= {L o {f,b) : {f,b) e T} and H := {I o {f + 
b) : (/, b) € J^}. Furthermore, for brevity's sake we denote the set of all con- 
stant functions from X to [a,b] by [a,b]. We then have 

M{B~^g,e,L2{T)) =J\f{B~^g,e,L2{T))<J\f{[0,X-f^]+B-^n,e,L2{T)). 

Using the Lipschitz-continuity of the hinge loss and the subadditivity of the 
log-covering numbers we hence find 

logM{B-^g,3e,L2{T)) 

< log AA([0, Xj%£, R) + logM{B~^n, 2e, L2{T)) 

< logQ + l) + logAA(B-i(7 • Bh + [-B,B]),2e, L2{Tx)) 

<21ogQ + l) +logAA(5H,e,^2(Tx)). 
From this we easily deduce the assertion. □ 
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7.2. Shrinking the size of the SVM minimizers. In this subsection we 
show that the trivial bound ||/r,Al| ^ A~^/^ can be significantly improved 
under the assumptions of Theorem 2.8. In view of Theorem 5.6 this im- 
provement will have a substantial impact on the rates of Theorem 2.8. In 
order to obtain a rather flexible result let us suppose that for all < p < 2 
we can determine constants c, 7 > such that 

(43) sup \ogM{BH^,e,L2iTx)) < ca^'^e-^ 

holds for all e > 0, a > 1. Recall that by Theorem 2.1 we can actually choose 
7:=(l-|)(l + (^) for an(^>0. 

Lemma 7.2. Let X be the closed unit ball of the Euclidean space M*^, 
and P be a distribution on X xY with Tsybakov noise exponent < g < 00 
and geometric noise exponent < a < 00. Furthermore, let us assume that 
(43) is satisfied for some < 7 < 2 and < p < 2. Given an < ? < ^ we 
define 



An ■■=n 



-(4(«+l)(g+l))/((2Q+l)(2Q+pg+4)+47(g+l))-l/(l-?) 



and cr„ := Xn^ ^ Assume that for the SVM without offset using the 
Gaussian RBF kernel with width (T„ there are constants 2{a+i) + 4? < p < ^ 
and C > 1 such that 

Pr*(T E (X X YT : ||/t,aJ| < CxK") > 1 - e"- 

for all n>l and all x >1. Then there is another constant C > 1 such that 
for p := ^( 2(0+1) + 4<;" + p) and for all n> 1, x > 1 we have 



Pr* {TG{Xx Yr : ||/t,a„ || < CxX~P) > 1 - e'^ 
Moreover, the same result is true for SVMs with offset. 

Proof. We only prove the lemma for SVMs without offset since the 
proof for SVMs with offset is analogous. Now let /r,A„ be a minimizer of 

TZl,t on CxXlf ^'*^^-Bhct„7 where L is defined by (40). By our assumption 
we have /t,a„ = /t,a„ with probability not less than 1 — e~^ since /T,An is 
unique for every training set T by the strict convexity of L. We show that 
for some constant C > and all n > 1, x >1 the improved bound 

(44) ||/T,Aj|<CxA(f-'^/' 

holds with probability not less than 1 — e~^. This then yields ||/t.a„|| ^ 
CxXn ^^^^ with probability not less than 1 — 2e~^, and from the latter we 
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easily obtain the assertion. In order to establish (44) we will apply Theo- 
rem 5.6 to the modified SVM classifier which produces /t,a,i- To this end 
we first remark that the infinite sample version /p,a„ which minimizes TZl^p 

on CxXl{' ^^^"^Bfj^^ exists by a small modification of [31], Lemma 3.1. Fur- 
thermore, by Proposition 6.3 and assumption (43) we observe that we may 
choose -B, a and c such that 

In addition, Theorem 2.7 shows ao-„ (A„) < An^^""*"^^ and thus by Proposi- 
tion 6.3 we may choose 

5 ^ ^(q+2)/{q+l)y^{aq~p{q+2){a+l))/({a+l)(q+l)) _ 

A rather time consuming but simple calculation then shows that the e-term 
in Theorem 5.6 satisfies 

_a 2p(a + l)-l t2a + l){2q+pq+i) + A-i{q+l) 

e{n,a,B,c,5,x)<x^K+' 2(.+i)(2,+p,+4) ^ 

Moreover, by Theorem 5.6 there is a constant Ci > independent of n and 
X such that for all n > 1 and all x > 1 the estimate 

An||/T,A„f < An||/T,A„f +^i,p(/T,Aj -^Z,P 

< A„||/p,A„ \? + T^i,p{fp,\n) - T^i,P + Cix^e{n, a, B, c, 5, x) 
holds with probability not less than 1 — e~^. Now A||/p^a|P < ^o-„(An) ^ 
^Q/(a+i) yigi^g ||/p;^^|| -< a;;^/(^("+^^) and hence p > 2(a+i) implies ||/p,a„|| < 

— Cx\~P for large n. In other words, for large n we have /p,a„ = fp,x„- 
Consequently, with probability not less than 1 — we have 

An||/T,A„ f < KWfpM f + T^iAfPM) - '^i,P + Cix^e{n, a, B, c, 6, x) 
which shows the assertion. □ 



7.3. Proof of Theorem 2.8. The next theorem almost establishes the re- 
sult of Theorem 2.8. We present this intermediate result because it clarifies 
the impact of covering number bounds of the form (43) on our rates. 

Theorem 7.3. Let X be the closed unit ball of the Euclidean space M*^, 
and P be a distribution on X xY with Tsybakov noise exponent <q < oo 
and geometric noise exponent < a < oo. Finally, let us assume that we can 
bound the covering numbers by (43) for some < 7 < 2 and <p <2. Given 
an <<; < ^ we define A„ and as in Lemma 7.2. Then for all e > there 
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is a constant C > such that for all x >1 and all n>l the SVM without 
offset and with regularization parameter A„ and Gaussian RBF kernel with 
width an satisfies 

Fr*iT:np{fT,xJ 

<np + Cx2n~(^"(5+^))/((2a+l){2<?+p9+4)+47{g+l))-l/{l-c)+20?+e^) 

> 1 -e"^. 

Moreover, the same result is true for SVMs with offset. 

Proof. Iteratively using Lemma 7.2 we find a constant C > 1 such that 
for p := 2(a+i) + 4? + e and all n > 1, x > 1 we have 

Pr*(r G (X X Yr ■■ ||/t,aJ| < CxX-") > 1 - e-\ 

Repeating the calculations of Lemma 7.2 we hence find a constant C > 
such that for all n > 1 and all j; > 1 we have 

A„,||/t,aJP+^«,p(/t,aJ-^«,p 

< A„||/p,aJP + 7^^,p(/p,AJ - 7^^,p 

with probability not less than 1 — e~^ . By the definition of p we obtain 

A,Q/(a+l)-(2p(a+l)-l)/(2{Q+l))-4? 

^ \a/(a+l)-4f-e-4<; 
— 

< ^-(4a(g+l))/((2a+l)(2g+p<?+4)+47(g+l))-l/(l-?)+20?+3e_ 

From this we easily deduce the assertion. □ 

In order to prove Theorem 2.8 recall that by Theorem 2.1 we can choose 
7 := (1 — |)(1 + 6) for all 6 > 0. The idea of the proof of Theorem 2.8 is 
to let 6^0 while simultaneously adjusting The resulting rate is then 
optimized with respect to p. Unfortunately, a rigorous proof requires p to 
be chosen a priori. Therefore, the optimization step is somewhat hidden in 
the following proof. 

Proof of Theorem 2.8. Let us first consider the case a < Our 
aim is to apply Theorem 7.3. To this end we write ps := 2 — 6 and js ■= 
(1 — ^)(1 + S) = 1(1 + 6) for 6 > 0. Furthermore, we define (^s by 

4(a + l)(g + l) 1 _ g + l 

{2a + l){4q - 6q + A) + A-fs{q + I) ' l-<;s~ 2a + l' 
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Since 2aq-q-2<0< 26{q + 1) we have q{2a + 1) < 2(1 + 6){q + 1) and 
hence 

4(2a + l)(g + 1) < 4(2a + + 1) - 6q{2a + 1) + 25{l +5){q + 1). 

This shows ?5 > for ah 5 > 0. Furthermore, these definitions also imply 
?5 — > and 75 — > whenever 5 — > 0. Now Theorem 7.3 tehs us that for all 
e > and all small enough 6 > there exists a constant Cs^e ^ 1 such that 
for all n > 1 , X > 1 we have 

Pr*(7^p(/T,AJ 

<np + C'5^2;2j^-(4a(g+l))/((2a+l)(4g-5g+4)+47a(5+l)).l/(l-ft)+20?6+£^ 

> 1 - e-^. 

In particular, if we choose 5 sufficiently small we obtain the assertion. 
Let us now consider the case < a < oo. In this case we write ps : = 6 

and 75 := (1 — ^)(1 + 5) = 1 + | — ^ for 5 > 0. Furthermore, we define 
by 

4(a + l)(g + l) 1 _ 2(a + l)(g + l) 

{2a+l){2q + 6q + 4)+4-fs{q+l) ' I - ^5 ~ 2a{q + 2) + 3q + 4 

Since for < 5 < 1 we have < 6q{2a + 1) + 26{q + 1) - 26"^ {q + 1) we easily 
check that c^s > 0. Furthermore, the definitions ensure — > and 75 ^ 1 
whenever 6^0. The rest of the proof follows that of the first case. Finally, let 
us treat the case a = 00. We define a\ by log A = axdlog^^. Since a > 2-v/d 
we have ax> for all < A < 1. Furthermore, applying Theorem 2.7 for ax 
we find a(A) < 2C^A for all < A < 1 and a constant > depending only 
on the dimension d. Adapted versions of Lemma 7.2 and Theorem 7.3 then 
yield the assertion. □ 
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